ELB(ALB)配下のスポットインスタンスの中断処理をStep Functionsで実装してみた
ELB(ALB)配下でスポットインスタンスを利用する環境において、インスタンス終了通知への自動対応をAWS Step Functionsで実装しました。
Step Functionsの AWS SDK サービス統合を活用する事で、Lambdaを使用せずに実現しています。
処理概要
初期処理
- 入力からインスタンスIDを取得
AutoScalingグループ情報の取得
- EC2インスタンス情報の取得(GetEC2InstanceInfo)
- 指定されたインスタンスIDの詳細情報を取得
- エラー発生時はHandleErrorへ移動
- タグの抽出(ExtractTags)
- インスタンスからAutoScalingグループ名のタグを抽出
- タグの検証(ValidateTag)
- AutoScalingグループ名のタグが存在するか確認
- AutoScalingグループの詳細情報を取得
並列処理(ParallelProcessing)
A. ターゲットグループ処理
- ターゲットグループARNの抽出
- ターゲットグループ情報の取得
- インスタンスをターゲットグループから登録解除
B. AutoScalingグループ処理
- 新しい希望容量を計算(現在の容量+1)
- 最大サイズとの比較
- 条件を満たす場合、AutoScalingグループの更新
- 結果のマージ(MergeResults)
終了処理
- 成功時はSuccessステート
- エラー時はHandleErrorステート
Step Functions Workflow Studio
CloudFormation
EventBridge、 Step Functions は、CloudFormationを利用して設置しました。
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Spot Instance Interruption Handler with ALB and Auto Scaling Integration'
Resources:
SpotInterruptionEventRule:
Type: AWS::Events::Rule
Properties:
Description: "Capture Spot Instance Interruption Warnings"
EventPattern:
source:
- aws.ec2
detail-type:
- EC2 Spot Instance Interruption Warning
State: "ENABLED"
Targets:
- Arn: !Ref SpotInterruptionStateMachine
Id: "SpotInterruptionStateMachine"
RoleArn: !GetAtt EventBridgeRole.Arn
EventBridgeRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: events.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: InvokeStepFunction
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action: states:StartExecution
Resource: !Ref SpotInterruptionStateMachine
SpotInterruptionStateMachine:
Type: AWS::StepFunctions::StateMachine
Properties:
StateMachineType: EXPRESS
LoggingConfiguration:
Level: ALL
IncludeExecutionData: true
Destinations:
- CloudWatchLogsLogGroup:
LogGroupArn: !GetAtt StepFunctionsLogGroup.Arn
DefinitionString: !Sub |
{
"Comment": "Handle Spot Instance Interruption",
"StartAt": "PrepareInstanceId",
"States": {
"PrepareInstanceId": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.detail.instance-id"
},
"Next": "GetEC2InstanceInfo"
},
"GetEC2InstanceInfo": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:ec2:describeInstances",
"Parameters": {
"InstanceIds.$": "States.Array($.InstanceId)"
},
"ResultPath": "$.InstanceInfo",
"Next": "ExtractTags",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError"
}
]
},
"ExtractTags": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.InstanceId",
"Tags.$": "$.InstanceInfo.Reservations[0].Instances[0].Tags[?(@.Key == 'aws:autoscaling:groupName')]"
},
"Next": "ValidateTag"
},
"ValidateTag": {
"Type": "Choice",
"Choices": [
{
"And": [
{
"Variable": "$.Tags[0].Value",
"IsPresent": true
}
],
"Next": "PrepareAutoScalingGroupName"
}
],
"Default": "HandleError"
},
"PrepareAutoScalingGroupName": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.InstanceId",
"AutoScalingGroupName.$": "$.Tags[0].Value"
},
"Next": "GetAutoScalingGroupInfo"
},
"GetAutoScalingGroupInfo": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:autoscaling:describeAutoScalingGroups",
"Parameters": {
"AutoScalingGroupNames.$": "States.Array($.AutoScalingGroupName)"
},
"ResultPath": "$.AutoScalingGroupInfo",
"Next": "ParallelProcessing",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleError"
}
]
},
"ParallelProcessing": {
"Type": "Parallel",
"Branches": [
{
"StartAt": "ExtractTargetGroupARNs",
"States": {
"ExtractTargetGroupARNs": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.InstanceId",
"TargetGroupARNs.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].TargetGroupARNs",
"AutoScalingGroupInfo.$": "$.AutoScalingGroupInfo"
},
"Next": "GetTargetGroupInfo"
},
"GetTargetGroupInfo": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:elasticloadbalancingv2:describeTargetGroups",
"Parameters": {
"TargetGroupArns.$": "$.TargetGroupARNs"
},
"ResultPath": "$.TargetGroupInfo",
"Next": "ValidateTargetGroups",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleTargetGroupError"
}
]
},
"ValidateTargetGroups": {
"Type": "Choice",
"Choices": [
{
"And": [
{
"Variable": "$.TargetGroupInfo.TargetGroups[0]",
"IsPresent": true
}
],
"Next": "DeregisterTarget"
}
],
"Default": "HandleTargetGroupError"
},
"DeregisterTarget": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:elasticloadbalancingv2:deregisterTargets",
"Parameters": {
"TargetGroupArn.$": "$.TargetGroupInfo.TargetGroups[0].TargetGroupArn",
"Targets": [
{
"Id.$": "$.InstanceId"
}
]
},
"ResultPath": "$.DeregisterResult",
"Next": "TargetGroupSuccessDeregistered",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"Next": "HandleTargetGroupError"
}
]
},
"HandleTargetGroupError": {
"Type": "Pass",
"Parameters": {
"Status": "Error",
"Error": {
"Message": "Target group operation failed",
"InstanceId.$": "$.InstanceId"
}
},
"End": true
},
"TargetGroupSuccessDeregistered": {
"Type": "Pass",
"Parameters": {
"Status": "Success",
"InstanceId.$": "$.InstanceId",
"DeregisterResult.$": "$.DeregisterResult"
},
"End": true
}
}
},
{
"StartAt": "PrepareCapacityComparison",
"States": {
"PrepareCapacityComparison": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.InstanceId",
"AutoScalingGroupInfo.$": "$.AutoScalingGroupInfo",
"NewDesiredCapacity.$": "States.MathAdd($.AutoScalingGroupInfo.AutoScalingGroups[0].DesiredCapacity, 1)",
"MaxSize.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].MaxSize"
},
"Next": "CheckScalingCondition"
},
"CheckScalingCondition": {
"Type": "Choice",
"Choices": [
{
"And": [
{
"Variable": "$.NewDesiredCapacity",
"NumericLessThanEqualsPath": "$.MaxSize"
}
],
"Next": "UpdateAutoScalingGroup"
}
],
"Default": "ASGSuccess"
},
"UpdateAutoScalingGroup": {
"Type": "Task",
"Resource": "arn:aws:states:::aws-sdk:autoscaling:updateAutoScalingGroup",
"Parameters": {
"AutoScalingGroupName.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].AutoScalingGroupName",
"DesiredCapacity.$": "$.NewDesiredCapacity"
},
"ResultPath": "$.UpdateASGResult",
"Catch": [
{
"ErrorEquals": ["States.ALL"],
"ResultPath": "$.UpdateASGError",
"Next": "ASGWithError"
}
],
"Next": "ASGSuccess"
},
"ASGWithError": {
"Type": "Pass",
"Parameters": {
"Status": "Warning",
"InstanceId.$": "$.InstanceId",
"Error.$": "$.UpdateASGError",
"Message": "Auto Scaling Group update failed but continuing execution"
},
"End": true
},
"ASGSuccess": {
"Type": "Pass",
"Parameters": {
"Status": "Success",
"InstanceId.$": "$.InstanceId",
"AutoScalingInfo": {
"CurrentDesiredCapacity.$": "$.AutoScalingGroupInfo.AutoScalingGroups[0].DesiredCapacity",
"MaxSize.$": "$.MaxSize",
"RequestedDesiredCapacity.$": "$.NewDesiredCapacity",
"UpdateSkipped": {
"Reason.$": "States.Format('Requested capacity {} exceeds max size {}', $.NewDesiredCapacity, $.MaxSize)"
}
}
},
"End": true
}
}
}
],
"ResultPath": "$.ParallelResults",
"Next": "MergeResults"
},
"MergeResults": {
"Type": "Pass",
"Parameters": {
"InstanceId.$": "$.InstanceId",
"FinalState": {
"TargetGroupOperations.$": "$.ParallelResults[0]",
"AutoScalingOperations.$": "$.ParallelResults[1]"
}
},
"Next": "Success"
},
"HandleError": {
"Type": "Fail",
"Error": "SpotInterruptionHandlingError",
"Cause": "Error occurred during spot interruption handling"
},
"Success": {
"Type": "Succeed"
}
}
}
RoleArn: !GetAtt StepFunctionsExecutionRole.Arn
StepFunctionsExecutionRole:
Type: AWS::IAM::Role
Properties:
AssumeRolePolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Principal:
Service: states.amazonaws.com
Action: sts:AssumeRole
Policies:
- PolicyName: EC2AndAutoScalingAccess
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- ec2:DescribeInstances
- autoscaling:DescribeAutoScalingGroups
- autoscaling:UpdateAutoScalingGroup
- elasticloadbalancing:DescribeTargetGroups
- elasticloadbalancing:DescribeTargetHealth
- elasticloadbalancing:DeregisterTargets
Resource: "*"
- PolicyName: CloudWatchLogsAccess
PolicyDocument:
Version: "2012-10-17"
Statement:
- Effect: Allow
Action:
- logs:CreateLogDelivery
- logs:GetLogDelivery
- logs:UpdateLogDelivery
- logs:DeleteLogDelivery
- logs:ListLogDeliveries
- logs:PutResourcePolicy
- logs:DescribeResourcePolicies
- logs:DescribeLogGroups
Resource: "*"
StepFunctionsLogGroup:
Type: AWS::Logs::LogGroup
Properties:
LogGroupName: !Sub "/aws/states/${AWS::StackName}-spotinterruption"
RetentionInDays: 180
動作検証
テスト環境の構築
スポットインスタンスの中断シナリオを評価するため、以下のコンポーネントをCloudFormationで構築しました:
- Application Load Balancer (ALB)
- Auto Scaling Group (スポットインスタンス設定)
- セキュリティグループ
- 基本的なWebサーバー設定(起動テンプレート)
AWSTemplateFormatVersion: '2010-09-09'
Description: 'Spot Instance EC2 Autoscaling with ALB'
Parameters:
VPCID:
Type: AWS::EC2::VPC::Id
Description: Select the VPC where you want to deploy the resources
EC2SubnetIds:
Type: List<AWS::EC2::Subnet::Id>
Description: Select subnets for EC2 instances and ALB
KeyName:
Description: Name of an existing EC2 KeyPair to enable SSH access to the instances
Type: AWS::EC2::KeyPair::KeyName
ConstraintDescription: must be the name of an existing EC2 KeyPair.
LatestAmiId:
Type: AWS::SSM::Parameter::Value<AWS::EC2::Image::Id>
Default: /aws/service/ami-amazon-linux-latest/al2023-ami-kernel-6.1-x86_64
Resources:
ALBSecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for ALB
VpcId: !Ref VPCID
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
CidrIp: 0.0.0.0/0
EC2SecurityGroup:
Type: AWS::EC2::SecurityGroup
Properties:
GroupDescription: Security group for EC2 instances
VpcId: !Ref VPCID
SecurityGroupIngress:
- IpProtocol: tcp
FromPort: 80
ToPort: 80
SourceSecurityGroupId: !Ref ALBSecurityGroup
ApplicationLoadBalancer:
Type: AWS::ElasticLoadBalancingV2::LoadBalancer
Properties:
Scheme: internet-facing
SecurityGroups:
- !Ref ALBSecurityGroup
Subnets: !Ref EC2SubnetIds
Type: application
ALBListener:
Type: AWS::ElasticLoadBalancingV2::Listener
Properties:
DefaultActions:
- Type: forward
TargetGroupArn: !Ref ALBTargetGroup
LoadBalancerArn: !Ref ApplicationLoadBalancer
Port: 80
Protocol: HTTP
ALBTargetGroup:
Type: AWS::ElasticLoadBalancingV2::TargetGroup
Properties:
HealthCheckPath: /
Name: my-alb-target-group
Port: 80
Protocol: HTTP
TargetType: instance
VpcId: !Ref VPCID
Ec2InstanceLaunchTemplate:
Type: AWS::EC2::LaunchTemplate
Properties:
LaunchTemplateName: EC2AutoScalingLaunchTemplate
LaunchTemplateData:
SecurityGroupIds:
- !Ref EC2SecurityGroup
InstanceInitiatedShutdownBehavior: terminate
KeyName: !Ref 'KeyName'
ImageId: !Ref LatestAmiId
InstanceType: t3.nano
UserData:
Fn::Base64: !Sub |
#!/bin/bash
yum install -y httpd
systemctl start httpd
systemctl enable httpd
echo "<h1>Hello from $(hostname -f)</h1>" > /var/www/html/index.html
Ec2InstanceAutoScalingGroup:
Type: AWS::AutoScaling::AutoScalingGroup
Properties:
VPCZoneIdentifier: !Ref EC2SubnetIds
MinSize: 2
MaxSize: 6
DesiredCapacity: 2
HealthCheckType: ELB
HealthCheckGracePeriod: 300
TargetGroupARNs:
- !Ref ALBTargetGroup
MixedInstancesPolicy:
InstancesDistribution:
OnDemandAllocationStrategy: prioritized
OnDemandBaseCapacity: 0
OnDemandPercentageAboveBaseCapacity: 0
SpotAllocationStrategy: capacity-optimized
LaunchTemplate:
LaunchTemplateSpecification:
LaunchTemplateId: !Ref 'Ec2InstanceLaunchTemplate'
Version: !GetAtt 'Ec2InstanceLaunchTemplate.LatestVersionNumber'
Overrides:
- InstanceType: t3a.nano
- InstanceType: t3.nano
中断テストの実施
AWS FIS (Fault Injection Simulator) を使用してスポットインスタンスの中断をシミュレーションしました。
検証結果
パフォーマンス評価
- スポット停止予告の発生:13:55:59
"detail-type": "EC2 Spot Instance Interruption Warning",
"source": "aws.ec2",
"time": "2024-10-23T13:55:59Z",
- ターゲットデタッチ実行:13:56:01.525(約2秒後)
グラフビュー
テーブルビュー
ステートマシンの実行時間は 0.474秒でした。
まとめ
スポットインスタンスの中断処理は Lambdaで類似の実装する事も可能ですが、今回、Step Functions を利用した事で、例外や並列処理を強化する事ができました。
またLambdaを採用した場合、利用するランタイムの更新が定期的に求められますが、Step Functions の AWS SDK サービス統合であればこの保守も不要となります。
ELB(ALB)配下でオートスケール起動のスポットインスタンスを利用する場合、停止予告に備える必要がある場合には、今回の仕組みを是非お試し下さい。